LASSO, which stands for Least Absolute Shrinkage and Selection Operator, is one of the model complexity control techniques like variable selection and ridge regression. In this notebook we'll demonstrate how to use the glmnet package for LASSO regression. For more information about LASSO you can refer to the LASSO Page.
This notebook is targeted toward data scientists who understand linear regression and want to find out how to fit LASSO regression in R. An operationalization step is also included to show how you can deploy in Azure a web service based on the selected model.
In this example, we'll use the housing data from the R package MASS
. There are 506 rows and 14 columns in the dataset. Available information includes median home price, average number of rooms per dwelling, crime rate by town, etc. More information about this dataset can be found at UCI or by typing help(Boston)
in an R terminal.
In [ ]:
library(MASS) # to use the Boston dataset
?Boston
For illustration purposes, we'll use "medv" - median home price - as the response variable and the remaining variables as predictors.
The first step in fitting LASSO regression is to determine the value of tuning parameter λ which controls the overall strength of the penalty. Here we'll use cross-validation to choose the λ that gives the least validation error.
In [ ]:
# to make results replicable
set.seed(123)
# load libraries
if(!require("glmnet", quietly = TRUE)) install.packages("glmnet")
library(glmnet) # to fit a LASSO model
library(MASS) # to use the Boston dataset
In [ ]:
# define response variable and predictor variables
response_column <- which(colnames(Boston) %in% c("medv"))
train_X <- data.matrix(Boston[, -response_column])
train_y <- Boston[,response_column]
# use cv.glmnet with 10-fold cross-validation to pick lambda
model1 <- cv.glmnet(x = train_X,
y = train_y,
alpha = 1,
nfolds = 10,
family = "gaussian",
type.measure = "mse")
In [ ]:
summary(model1)
options(repr.plot.width = 5, repr.plot.height = 5)
plot(model1)
In the plot above:
Lambda that gives minimum mean cross-validated error:
In [ ]:
round(model1$lambda.min, 4)
Largest lambda with mean cross-validated error within 1 standard error of the minimum error:
In [ ]:
round(model1$lambda.1se, 4)
Coefficients based on lambda that gives minimum mean cross-validated error:
In [ ]:
coef(model1, s = "lambda.min")
While the model selected by cv.glmnet() can be used for making predictions, we also want to better understand how the values of λ impact the estimated coefficients. Such information can be produced by the glmnet() function. In the plot that's generated below, it can be observed that the coefficients shrink toward zero as the value of λ increases.
In [ ]:
model2 <- glmnet(x = train_X,
y = train_y,
alpha = 1,
family = "gaussian")
# identify variable names
vn = colnames(train_X)
vid <- as.character(seq(1, length(vn)))
# check and exclude the variables with coefficient value 0
vnat = coef(model2)
vnat_f <- vnat[-1, ncol(vnat)]
vid <- vid[vnat_f != 0]
vn <- vn[vnat_f != 0]
#define the legend description, line type, and line color
nvars <- length(vn)
legend_desc <- paste(vid, vn, sep=": ")
legend_desc
mylty <- rep(1,nvars)
mycl <- seq(1,nvars)
By increasing the value of lambda, the regression parameters are increasingly penalised, and thus move closer to zero.
In the lambda plot below, you can see how the coefficients gradually decrease in value as lambda increaes. This has a particularly high impact on variable 5 (nox, nitrogen oxides concentration (parts per 10 million)).
This shows the value of LASSO regression: the algorithm deals very well with problematic predictors, for example situations where the predictors are higly correlated with one another (multi-collinearity).
In [ ]:
# plot
options(repr.plot.width = 6, repr.plot.height = 6)
plot(model2, xvar = "lambda", label = TRUE, col = mycl, xlim = c(-5.5, 2))
legend(-0.5,-2, legend_desc, lty = mylty, col = mycl, cex = 0.8)
The coefficients from this model with the optimal λ are also printed below. As we would expect, they are the same as those from using cv.glmnet().
In [ ]:
coef(model2, s = model1$lambda.min)
To make predictions, you can use either of the two models:
In [ ]:
x_new <- data.matrix(train_X[, -response_column])
pred1 <- predict(model1, newx = x_new, s = "lambda.min")
pred2 <- predict(model2, newx = x_new, s = model1$lambda.min)
head(
data.frame(actual = Boston$medv, model1 = as.vector(pred1), model2 = as.vector(pred2))
)
In [ ]:
# define predict function
mypredict <- function(newdata){
if("medv" %in% names(newdata)) {
w <- match("medv", names(newdata))
newdata <- newdata[, -w]
}
require(glmnet)
newdata <- data.matrix(newdata) # the prediction data need to be a matrix for glmnet
predict(model2, newx = newdata, s = model1$lambda.min)
}
# test the prediction function
newdata <- Boston[1:10, ]
mypredict(newdata)
In [ ]:
# load the library
library(AzureML)
# workspace information
ws <- workspace()
# deploy the service
ep <- publishWebService(ws = ws, fun = mypredict,
name = "LASSOPrediction",
inputSchema = newdata,
outputSchema = list(ans = "numeric"))
str(ep)
In [ ]:
newdata <- Boston[1:10, ]
pred <- consume(ep, newdata)$ans
data.frame(actual = newdata$medv, prediction = pred)
Using the Boston
housing dataset, we demonstrated how to carry out LASSO regression analysis. We started the analysis by determining the optimal value of the tuning parameter λ using cross-validation. Then we examined the impact of λ on the coefficient estimates. A web service was deployed based on the selected model.
Created by a Microsoft Employee.
Copyright (C) Microsoft. All Rights Reserved.